Latest Open-Source Speech Model

Focused on the AIGC field, our professional community keeps track of the development and application of large language models (LLMs) such as Microsoft’s & OpenAI’s, Baidu’s Wenxin Yiyan, and iFlytek’s Spark. We also focus on market research of LLMs and the AIGC developer ecosystem. Stay tuned!

Generative AI startup aiOla has open-sourced its latest speech model, Whisper-Medusa, on its official website. This model has an inference efficiency that is 50% faster than OpenAI’s open-source Whisper.

aiOla modified the Whisper architecture by adopting a parallel computation method with a “multi-head attention” mechanism, allowing the model to predict multiple tokens at each inference step without compromising performance and recognition accuracy.

Open-source URL: https://github.com/aiola-lab/whisper-medusa

Huggingface URL: https://huggingface.co/aiola/whisper-medusa-v1

Whisper-Medusa

The traditional Transformer architecture follows a sequential prediction process, generating tokens one by one. This means that when generating a new sequence, the model can only predict the next token, add it to the sequence, and then use the updated sequence to predict the following token.

Although this ensures the coherence and contextual relevance of the generated sequence, it also has a significant drawback - it greatly limits the model’s inference efficiency.

Additionally, because the model can only process one token at a time, it struggles to capture long-range dependencies in the data, potentially overlooking important global information, thus affecting the model’s overall performance and accuracy.

Whisper-Medusa

Whisper-Medusa uses a 10-head multi-attention mechanism, enabling independent computation of attention distributions and parallel processing of inputs. The outputs are then combined through concatenation, forming a multi-dimensional vector.

This vector is then sent to a fully connected layer for further processing to generate the final token prediction. This parallel data processing method not only speeds up the model’s inference efficiency but also enhances its expressive capability, as each attention head can focus on different subsets of the sequence, capturing richer contextual information.

To make the multi-head attention mechanism run more efficiently in the Whisper-Medusa model, aiOla employed a weak supervision approach. During training, they froze the main components of the original Whisper model and used the transcriptions generated by this model as pseudo-labels to train the additional token prediction module.

This allows the model to learn effective speech recognition patterns even without a large amount of manually annotated data.

Furthermore, during training, Whisper-Medusa’s loss function needs to consider both prediction accuracy and efficiency. On one hand, the model needs to ensure that the predicted token sequence is as consistent as possible with the actual transcription;

On the other hand, through parallel predictions with the multi-head attention mechanism, the model is encouraged to speed up prediction efficiency as much as possible without sacrificing accuracy.

aiOla used various techniques such as learning rate scheduling, gradient clipping, and regularization to ensure stable convergence of the model during training while avoiding overfitting.

Whisper-Medusa

In terms of business scenarios, Whisper-Medusa can understand more than 100 languages, enabling users to develop various applications such as audio transcription and recognition, suitable for industries like translation, finance, tourism, logistics, and warehousing.

aiOla stated that in the future, they will expand Whisper-Medusa’s multi-attention mechanism to 20 heads, further significantly enhancing its inference efficiency.

Whisper-Medusa